{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/experimental/semantic-search-intro/tfidf.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/experimental/semantic-search-intro/tfidf.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TF-IDF\n",
    "\n",
    "TF-IDF is one of the best known methods for text focused search. In this notebook we'll explore how it works, and implement it in Python. Let's start by creating a few example sentences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = \"purple is the best city in the forest\".split()\n",
    "b = \"there is an art to getting your way and throwing bananas on to the street is not it\".split()\n",
    "c = \"it is not often you find soggy bananas on the street\".split()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To calculate the TF-IDF for a given word (the query) and a sentence (the document), we calculate the **T**erm **F**requency (**TF**), and the **I**nverse **D**ocument **F**requency (**IDF**)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# we'll merge all docs into a list of lists for easier calculations below\n",
    "docs = [a, b, c]\n",
    "\n",
    "def tfidf(word, sentence):\n",
    "    # term frequency\n",
    "    tf = sentence.count(word) / len(sentence)\n",
    "    # inverse document frequency\n",
    "    idf = np.log10(len(docs) / sum([1 for doc in docs if word in doc]))\n",
    "    return round(tf*idf, 4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's calculate the score between each sentence *a*, *b*, and *c* - against the word *'forest'*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0596"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tfidf('forest', a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tfidf('forest', b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tfidf('forest', c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We would expect sentence *a* to score highest (as the only sentence that contains the word), and that is what we see above.\n",
    "\n",
    "---\n",
    "\n",
    "## Vectors\n",
    "\n",
    "Now, calculating TF-IDF vectors is slightly different. It requires that we compute the TF-IDF score for all words within our document vocabulary (a set of all words across all documents), and we compute that for each sentence (document) - producing document-specific TF-IDF vectors.\n",
    "\n",
    "Let's start by creating a vocabulary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab = set(a+b+c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we create the vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0596, 0.0, 0.0, 0.0, 0.0596, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0596, 0.0, 0.0, 0.0596, 0.0, 0.0596, 0.0]\n"
     ]
    }
   ],
   "source": [
    "# initialize vectors\n",
    "vec_a = []\n",
    "vec_b = []\n",
    "vec_c = []\n",
    "\n",
    "for word in vocab:\n",
    "    vec_a.append(tfidf(word, a))\n",
    "    vec_b.append(tfidf(word, b))\n",
    "    vec_c.append(tfidf(word, c))\n",
    "\n",
    "print(vec_a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.0, 0.0, 0.0265, 0.0265, 0.0098, 0.0, 0.053, 0.0, 0.0265, 0.0265, 0.0, 0.0, 0.0, 0.0098, 0.0098, 0.0098, 0.0, 0.0098, 0.0, 0.0265, 0.0265, 0.0, 0.0265, 0.0, 0.0265]\n"
     ]
    }
   ],
   "source": [
    "print(vec_b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.0434, 0.0, 0.0, 0.0, 0.016, 0.0434, 0.0, 0.0, 0.0, 0.0, 0.0434, 0.0, 0.0434, 0.016, 0.016, 0.016, 0.0, 0.016, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]\n"
     ]
    }
   ],
   "source": [
    "print(vec_c)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13 (default, Mar 28 2022, 06:59:08) [MSC v.1916 64 bit (AMD64)]"
  },
  "orig_nbformat": 2,
  "vscode": {
   "interpreter": {
    "hash": "5fe10bf018ef3e697f9035d60bf60847932a12bface18908407fd371fe880db9"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
